Credit Card Customer Churn Prediction Engine

Churn%20Picture.jpg

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

  1. Explore and visualize the dataset.
  1. Build a classification model to predict if the customer is going to churn or not
  1. Optimize the model using appropriate techniques
  1. Generate a set of insights and recommendations that will help the bank

Data Dictionary:

CLIENTNUM: Client number. Unique identifier for the customer holding the account

Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"

Customer_Age: Age in Years

Gender: Gender of the account holder

Dependent_count: Number of dependents

Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.

Marital_Status: Marital Status of the account holder

Income_Category: Annual Income Category of the account holder

Card_Category: Type of Card

Months_on_book: Period of relationship with the bank

Total_Relationship_Count: Total no. of products held by the customer

Months_Inactive_12_mon: No. of months inactive in the last 12 months

Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months

Credit_Limit: Credit Limit on the Credit Card

Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance

Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter

Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter

Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

.

.

.

.

Importing all required Libraries:

Installing imblearn package

Loading and exploring the data

There are 10127 rows and 21 columns in the dataset

Checking Missing Values

Observation from given dataset

  1. There are 10127 records and 21 features in the dataset.
  1. There are no duplicate records.
  1. Our target variable will be Attrition_Flag.
  1. Education_Level has 15% and Marital_Status has 7.4% Missing values.
  1. Average Customer_Age is 46, Minimum is 26 and Maximum is 73.
  1. Average Customer Relationship with Bank is nearly 3 years with a minimum of just over a year and maximum of nearly 5 years (56 months)
  1. On an average Customers hold nearly 4 products
  1. Mean Credit_limit 8631 while median is 4549 , indicates some Customers might have very high Credit Limits.
  1. Total_Trans_Amt has an average of 4404 and median of 3899. This indicate data is right skewed with outliers on higher end.
  1. On an Average Customers have more transactions in first Quarter as compared to fourth Quarter. Ratio of transaction count between Q4 : Q1 is 7 : 10
  1. 75% of the customers only consumes 75% of their Credit Limit with an average of just 27%

Exploring Data

Deleting CLIENTNUM
Converting following Columns into Categorical
Looping thru the Categorical Features and exploring each category count

Observations:

  1. More than 1500 records are missing Education_level and almost 750 missing in Marital_Status.
  1. Income_category has more than 1000 non-numeric junk values 'abc' that needs to be Imputed.
  1. Some prefix and suffix treatments need to be make on Income_category values too.
  1. Most of the customers uses Blue Card.
  1. Total 1627 customers attrited.
  1. Most of the customers have 2 or 3 dependents.
  1. Male and female customer count is very close with Females being slightly higher in number.
  1. Only less than 10% of the customers uses just 1 product. Most of them holds 3 products.

Univariate Analysis

A function to plot histogram and boxplot for each Quantitive features together
Looping thru all the Numerical Columns and plotting them sequentially

Observations

Lets analyze the Categorical data

Observations

Bivariate and Multivariate analysis

Lets plot the correlations which are -

Greater than 0.6

Less than -0.1

Lets see how the Pairplot looks

Observations

Lets find out how Customer Attrition is distributed across the Features

A function to show the distribution of the target variable Attrition Yes or No based on other features

Observation:

Creating a new column to bucket Customer Age for better data analysis

Statistics of Categorical features vs Attrition

Observations:

Customer Segmentation based on Product

Exploring statistical distribution of various Features of Attrited Customets

Conclusions from EDA

Churned Customer Profiling (Based on Product - Card Type)

Blue Card

Gold Card

Silver Card

Platinum card

Data Preprocessing

Outlier Treatment

Finding and Printing Outlier % on each features

Lets explore Outliers in following Attributes due to higher Percentage:

Lets check out Credit Limit
Lets check out Avg_Open_To_Buy
Lets check out Total_Trans_Amt

Observations from Outliers

Missing-Value Treatment

Encoding categorical features to Numeric for KNN Imputation

Data Preparation for Modeling

Splitting the Dataset

Imputing Missing Values

Decoding the Encoded values back to the Original

Checking inverse mapped values/categories.

Creating Dummy Variables

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Developing some Funtions to evaluate Models:

Function to plot a Confusion Matrix
Creating a Function to evaluate Model Performance on various Parameters
Visualizing the Decision Tree (Default Decision Tree Model)
Visualizing Feature Importance

Let's start by building different models using KFold and cross_val_score

Lets build following Models on Training Data and Predict on Validation Data

  1. LogisticRegression
  1. Decision Tree
  1. RandomForest
  1. Bagging
  1. AdaBoost
  1. GradientBoost
  1. XGBoost

Performing Model Analysis of all 7 Models using Training Data

Model Performance on Normal Training Data and Predicting on Validation Set

Oversampling train data using SMOTE

Performing Model Analysis of all 7 Models using Over Sampled Training Data

Model Performance on Oversampled Training Data and Predicting on Validation Set

Undersampling train data using Random Under Sampler

Performing Model Analysis of all 7 Models using Under Sampled Training Data

Model Performance on Undersampled Training Data and Predicting on Validation Set

Creating a Model Score Sheet to put together all 21(7 X 3) Model Scorings at one place

Conclusion from comparison of all the Models

Logistic Regression Performance is very Low. Lets try Regularization and see how it behaves

Regularization:

Regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

Main Regularization Techniques

  1. Ridge Regression (L2 Regularization) - Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.

  2. Lasso Regression (L1 Regularizaion) - Lasso adds "absolute values of magnitude of coefficient as penalty term to the loss function

  1. Using Training Data to tune a Logistic Regression Model along with Regularization
  1. Using Over Sampled Training Data to tune a Logistic Regression Model along with Regularization
  1. Using Under Sampled Training Data to tune a Logistic Regression Model along with Regularization

Observation from Regularization:

  1. We have tried Regularization on Logistic Regression Model as it was poorly performing earlier.
  1. a. Tuned and Regularized on Training Data and Predicted on Validation Set.

    b. Tuned and Regularized on Over Sampled Training Data and Predicted on Validation Set.

    c. Tuned and Regularized on Under Sampled Training Data and Predicted on Validation Set.

  1. Among above 3, the Under Sample Regularized model generalized well on training and validation set . Our recall after undersampling on validation set was better than our recall after oversampling on test.
  1. Received a Recall Score of 84% on Under Sampled Training Data and 85% on Validation set which is better but we have Other models that performed better than this.

Hyperparameter tuning using Random search

Random Search: Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.

AdaBoost

Conclusion from Tuned AdaBoost Model

Gradient Boost

Conclusion from Tuned Gradient Boost Model

XGBoost

Conclusion from Tuned XGBoost Model

Model Performance comparison

Observations

Performance on the test set

Conclusion from Model Performance on Test Data

* RandomizedSearch Tuned XGBoost model performed extremely well on the Unseen Test Data.
* Only 9 out of 2026 records are predicted incorrectly which is just 0.44% are False Negatives.
* Produced a very high Recall of 97%.
* 7.45% records are False Positives where our ML Algorithm predicted that these Customers will attrite but they didn't at the end.
* We will go ahead and Productinize this Model using a Pipeline

Productionize the model

Pipelines for productionizing the model

Column Transformer

Now we already know the best model we need to process with, so we don't need to divide data into 3 sets - train, validation and test

Creating a Pipeline with the Best Model and Preprocessor

Actionable Insights & Recommendations

  • Lower transcation count on credit card , less revolving balance , less transcational amount are an indication that customer will attrite. And this indicates --

    • Customer is not using this credit card ,
    • Bank should offer more rewards or cashback or some other offers to customer to attract.
  • From EDA if customer hold more product with the bank he/she gets less likely to attrite.Bank can offer more products to such customers so they buy more products which will help retain such customers
  • Customers who have been inactive for a month show high chances of attrition.Bank should focus on such customers as well.
  • Avg utilization ratio is lower amongst attrited customers.
  • Customers in

    • Age range 36-55 ,
    • who were doctorate or postgraduate ,
    • or Female attrited more.

    One of the reasons can be some competitive bank is offering them better deals leading to lesser user of this banks credit card.

  • Customers who have more communication with the bank in the last 12 months have attrited. This needs to be investigated whether there were any issues of customers which were not resolved leading into customer leaving the bank.